Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: njwhite Date: Feb 11, 2014 9:08am
Forum: texts Subject: Using Tesseract to improve OCR for some languages

I've been using and improving Tesseract OCR for some time, in particular I developed a good training file for OCR of Ancient Greek (now part of the main Tesseract distribution).

The current OCR for Ancient Greek books on archive.org is garbage as it seems to treat it as Latin (though ABBYY doesn't support Ancient Greek anyway).

It would be great if the OCR process used the language metadata to select the OCR process to use, and in the case of Ancient Greek chose Tesseract.

Is that something that's being worked on? I'd be very happy to help however I can, if anybody is interested. I'm also very active on the tesseract-ocr mailing list if anyone wants to contact me there .

Nick White

This post was modified by njwhite on 2014-01-29 14:56:17

This post was modified by njwhite on 2014-02-11 17:08:48

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Jan 29, 2014 9:38pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Ancient Greek is one of the languages we can OCR using our Abbyy module.

Reply to this post
Reply [edit]

Poster: njwhite Date: Jan 30, 2014 1:36am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Oh really? I didn't know they had pre-built modules for that (the http://finereader.abbyy.com/corporate/tech_specs/ page doesn't mention it). Note that Ancient Greek is quite different to modern Greek (more diacritics, different vocabulary).

Is there some way you can automatically select the language to OCR? Because as I mentioned at present all the Ancient Greek books I've looked at in the archive appear to have been treated as latin, with unusable results.

Reply to this post
Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:19pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

To see what Jeff meant by ‘Abbyy module’ see the attached extract from a log file.

I suggest that you upload a PDF file containing Ancient Greek exclusively, in the manner suggested by Jeff, and check what happens.

Attachment: Module-AbbyyXML.txt

Reply to this post
Reply [edit]

Poster: tfmorris Date: Feb 11, 2014 8:56am
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Does anyone look at improving OCR quality on an ongoing basis, whether through the use of Tesserract or other means? Is OCR ever re-done after the initial pass?

I did a study recently scoring the OCR quality of public domain eBooks in IA and found the quality to be all over the map. I suspect that Tesserract could do a better job in many cases, but I also suspect that ABBYY could be improved as well.

I saw some anecdotal evidence that high processing loads on the OCR cluster and the use of "fast mode" was correlated low OCR quality. That seems fine as an interim if the books were later requeued for full processing, but that doesn't seem to happen.

Like Nick, I'd be willing to help with improving the OCR quality.

Tom

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Jan 30, 2014 1:32pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

yes, if you use our html5 uploader with Chrome, Firefox or Safari at archive.org/upload you will see a language cell. Click in it and the dropdown menu includes Greek, Ancient.

Reply to this post
Reply [edit]

Poster: susannamore53 Date: May 17, 2015 9:09pm
Forum: texts Subject: Re: Using Tesseract to improving OCR for some languages

Hi, I have found a customize able OCR Software that supports various languages.

.net ocr api
.net ocr library open source

This post was modified by susannamore53 on 2015-05-18 04:09:51

Reply to this post
Reply [edit]

Poster: HanmoLingfeng Date: Aug 12, 2014 7:52pm
Forum: texts Subject: Re: Using Tesseract to improve OCR for some languages

Hi,guy.I should do my traineddata for android app,but my app always crash when I use my traineddata file.Could you give me some help?Thx your help.
add:
If anyone could give me some advice,please touch me.My work was blocked by this question,Thanks for your goodness.
My E-mail:flynigege@gmail.com.

Reply to this post
Reply [edit]

Poster: njwhite Date: Aug 13, 2014 7:50am
Forum: texts Subject: Re: Using Tesseract to improve OCR for some languages

HanmoLingfeng, this is totally the wrong place to ask for that kind of support. Try the Tesseract OCR mailing list: http://groups.google.com/group/tesseract-ocr/

Reply to this post
Reply [edit]

Poster: HanmoLingfeng Date: Aug 13, 2014 6:48pm
Forum: texts Subject: Re: Using Tesseract to improve OCR for some languages

Ok.THx you guy.